A Corpus-based Contrastive Analysis for Defining Minimal Semantics of Inter-sentential Dependencies for Machine Translation
نویسندگان
چکیده
Inter-sentential dependencies such as discourse connectives or pronouns have an impact on the translation of these items. These dependencies have classically been analyzed within complex theoretical frameworks, often monolingual ones, and the resulting fine-grained descriptions, although relevant to translation, are likely beyond reach of statistical machine translation systems. Instead, we propose an approach to search for a minimal, feature-based characterization of translation divergencies due to inter-sentential dependencies, in the case of discourse connectives and pronouns, based on contrastive analyses performed on the Europarl corpus. In addition, we show how to automatically assign labels to connectives and pronouns, and how to use them for statistical machine translation. 1. The Need for Inter-sentential Information in Machine Translation Long-range dependencies are a well known challenge for machine translation (MT) systems, especially for statistical ones. The correct translation of lexical items such as pronouns often depends on the correct identification of their antecedent. Similarly, the correct translation of multi-functional discourse connectives depends on the correct identification of the rhetorical relation which they convey between two clauses. However, especially when translating between closely related languages, the full disambiguation of such lexical items is sometimes unnecessary for a correct translation. The question that arises is thus how to find the most suitable level of representation for such dependencies, as a trade-off between linguistic accuracy and computational tractability, with the direct aim of improving MT output. This paper presents a method for finding the minimal semantic/discourse information that must be assigned to two types of lexical items, namely connectives and pronouns, in order to avoid translation mistakes by statistical MT systems. The method starts from contrastive analyses of a frequently used parallel corpus, Europarl (Koehn, 2005), in order to define and annotate the minimal semantic/discourse information necessary for MT. The paper first describes our analyses and manual annotation methods for disambiguating connectives (Section 2.1) and pronouns (Section 2.2), in the context of English/French MT. Section 3 outlines methods for automatically performing these disambiguation tasks, while Section 4 explains how the automatically labeled linguistic items can be integrated into a statistical MT system. Section 5 concludes the paper and outlines future work. 2. Contrastive Analysis of Two Types of Inter-sentential Dependencies 2.1 Discourse Connectives Discourse connectives are generally considered as indicators of discourse structure, relating two sentences or propositions and making explicit the rhetorical relation between them. Explicit discourse connectives such as because, but, however, since, while, etc., are frequent lexical items and are used to mark rhetorical relations such as Cause or Contrast between units of discourse. Several theoretical frameworks have been proposed for connectives (mainly starting from English ones), such as the Rhetorical Structure Theory (RST) (Mann and Thompson, 1988), or the Segmented Discourse Representation Theory (SDRT) (Asher, 1993). In such theories, more than one hundred possible rhetorical relations have been identified, and complex semantic and logical representations have been used to characterize discourse structure. In a more empirically oriented effort, the Penn Discourse Treebank (PDTB) (Prasad et al., 2008) contains manual annotations of discourse connectives with a large set of labels: for example, the connective while was annotated with 17 possible senses beyond its for main meanings, which are Comparison, Contrast, Concession and Opposition (Miltsakaki et al., 2005). While a fine-grained characterization provides the necessary theoretical level of linguistic description of discourse structure, it may prove to be intractable to fully automatic processing. Nevertheless, the disambiguation of at least the main senses of discourse connectives is generally required for their translation, to avoid the rendering of a wrong sense in translation. For instance, in the following example, the French connective alors que in its contrastive usage is wrongly translated to the English connective so, which signals a causal meaning instead. FR: Oui, bien entendu, sauf que le développement ne se négocie pas, alors que le commerce, lui, se négocie. EN: *Yes, of course, but development cannot be negotiated, so that trade can. To disambiguate connectives for MT, parallel corpora with sense-labeled connectives are required for training and test. As the PDTB data is in English only, we performed manual annotation on the Europarl corpus. The annotation method, called translation spotting, requires annotators to consider bilingual sentence pairs, and annotate each connective in the source language with its translation in the target language (Meyer et al., 2011). A contrastive analysis showed that these translations can be: a target language connective (in principle signaling the same sense(s) as the source language one), reformulations with different syntactical constructs, or no connective at all. The indications gained with this method are then used in a second step to manually derive and cluster the minimal semantic and theoryindependent labels needed to generate correct translations of a connective. We exemplify this procedure here for the English connective while. From the Europarl corpus for English-French, we extracted 499 sentences containing the connective while. In 198 cases (43%) the annotators spotted 'no translation' or reformulations of the connective. In the remaining 301 sentences (57%), the annotators identified the corresponding French connectives. As a second step, the French connectives (signaling the same rhetorical relation(s) as while itself) were manually clustered under the minimally necessary sense labels to disambiguate the connective while in order to translate it correctly from EN to FR. The most frequent French connective clusters and the derived sense labels are the following: alors que (18%) Contrast/Temporal si / même si / bien que / s'il est vrai que (25%) Concession tandis que / mais (9 %) Contrast tant que (2%) Temporal/Causal pendant (1%) Temporal/Duration puisque (1%) Temporal/Causal lorsque (0.8%) Temporal/Punctual 1 The only exception is the case when the ambiguity of a connective is conserved in translation. 2 Source sentence from Europarl, translated by Moses (Koehn et al., 2007) trained on Europarl. 3 These are valid translation problems and will be reconsidered for clustering in future work. Compared to the PDTB sense hierarchy for example, the clustered senses for while are as detailed as the PDTB ones on hierarchy level 2, but less detailed than the deepest PDTB level 3. For the temporal meaning of while, however, even more differentiation than PDTB level 3 is needed in order to be able to generate the correct translations.
منابع مشابه
Improving MT coherence through text-level processing of input texts: the COMTIS project
This paper presents an ongoing research project, started in March 2010 and sponsored by the Swiss National Science Foundation, which aims at improving machine translation output in terms of textual coherence. Coherence in text is mainly due to inter-sentential dependencies. Statistical Machine Translation (SMT) systems, currently sentence-based, often fail to translate these dependencies correc...
متن کاملSub-Sentential Alignment Method by Analogy
This paper describes a method for searching word correspondences between pairs of translation sentences. In the Example-Based Machine Translation, translation patterns can be extracted easily if word correspondences between pair of translation sentences are defined. The popular methods for aligning bilingual corpus at a sub-sentential level are unable to produce reliable result when the size of...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملThe Effect of Intra-sentential, Inter-sentential and Tag- sentential Switching on Teaching Grammar
The present study examined the comparative effect of different types of code-switching, i.e., intrasentential,inter-sentential, and tag-sentential switching on EFL learners grammar learning andteaching. To this end, a sample of 60 Iranian female and male students in two different institutionsin Qazvin was selected. They were assigned to four groups. Each group was randomly assigned toone of the...
متن کاملContrastive Analysis of Political News Headlines Translation According to Berman’s Deformative Forces
The present research aimed at investigating the deformation of political news headlines translation between English and Persian News Agencies based on Berman`s deformative system. For this purpose, 100 news headlines in English were selected from BBC, Reuters, Associated Press, France, France 24, Financial Times, Business Times, New York Times, Politico, Guardian, CNN, Bloomberg, Middle East Ey...
متن کامل